The purpose of this notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].
The following steps are taken:
Ingest data
Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
Merge the data.
Make data summaries and related plots.
Make corresponding geo-plots.
Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].
Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.
From the help of tolower:
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
dfNYDataStates <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv", stringsAsFactors = FALSE )
colnames(dfNYDataStates) <- capwords(colnames(dfNYDataStates))
head(dfNYDataStates)
summary(as.data.frame(unclass(dfNYDataStates)))
Date State Fips Cases Deaths
2020-03-28: 55 Washington : 77 Min. : 1.00 Min. : 1 Min. : 0.0
2020-03-29: 55 Illinois : 74 1st Qu.:17.00 1st Qu.: 7 1st Qu.: 0.0
2020-03-30: 55 California : 73 Median :30.00 Median : 75 Median : 1.0
2020-03-31: 55 Arizona : 72 Mean :30.89 Mean : 1467 Mean : 34.1
2020-04-01: 55 Massachusetts: 66 3rd Qu.:46.00 3rd Qu.: 565 3rd Qu.: 10.0
2020-04-02: 55 Wisconsin : 62 Max. :78.00 Max. :130703 Max. :4758.0
(Other) :1609 (Other) :1515
dfNYDataCounties <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", stringsAsFactors = FALSE )
colnames(dfNYDataCounties) <- capwords(colnames(dfNYDataCounties))
head(dfNYDataCounties)
summary(as.data.frame(unclass(dfNYDataCounties)))
Date County State Fips Cases Deaths
2020-04-06: 2497 Washington: 503 Georgia : 2243 Min. : 1001 Min. : 0.00 Min. : 0.000
2020-04-05: 2444 Unknown : 494 Texas : 2097 1st Qu.:17099 1st Qu.: 2.00 1st Qu.: 0.000
2020-04-04: 2410 Jefferson : 375 Virginia : 1475 Median :28115 Median : 4.00 Median : 0.000
2020-04-03: 2357 Franklin : 337 California : 1462 Mean :29352 Mean : 74.45 Mean : 1.693
2020-04-02: 2291 Jackson : 305 Indiana : 1329 3rd Qu.:42103 3rd Qu.: 16.00 3rd Qu.: 0.000
2020-04-01: 2231 Montgomery: 300 North Carolina: 1311 Max. :56043 Max. :68776.00 Max. :2738.000
(Other) :23967 (Other) :35883 (Other) :28280 NA's :549
dfUSACountyData <- read.csv( "https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv", stringsAsFactors = FALSE )
head(dfUSACountyData)
summary(as.data.frame(unclass(dfUSACountyData)))
Country State County FIPS Population Lat Lon
UnitedStates:3143 Texas : 254 WashingtonCounty: 30 Min. : 1001 Min. : 89 Min. :19.60 Min. :-166.90
Georgia : 159 JeffersonCounty : 25 1st Qu.:18178 1st Qu.: 10980 1st Qu.:34.70 1st Qu.: -98.23
Virginia: 134 FranklinCounty : 24 Median :29177 Median : 25690 Median :38.37 Median : -90.40
Kentucky: 120 JacksonCounty : 23 Mean :30390 Mean : 102248 Mean :38.46 Mean : -92.28
Missouri: 115 LincolnCounty : 23 3rd Qu.:45082 3rd Qu.: 67507 3rd Qu.:41.81 3rd Qu.: -83.43
Kansas : 105 MadisonCounty : 19 Max. :56045 Max. :10170292 Max. :69.30 Max. : -67.63
(Other) :2256 (Other) :2999
dsNYDataCountiesExtended <-
dfNYDataCounties %>%
dplyr::inner_join( dfUSACountyData %>% dplyr::select_at( .vars = c("FIPS", "Lat", "Lon", "Population") ), by = c( "Fips" = "FIPS" ) )
dsNYDataCountiesExtended
ParetoPlotForColumns( dsNYDataCountiesExtended, c("Cases", "Deaths"), scales = "free" )
cf <- colorBin( palette = "Reds", domain = log10(dsNYDataCountiesExtended$Cases), bins = 10 )
m <-
leaflet( dsNYDataCountiesExtended[, c("Lat", "Lon", "Cases")] ) %>%
addTiles() %>%
addCircleMarkers( ~Lon, ~Lat, radius = ~ log10(Cases), fillColor = ~ cf(log10(Cases)), color = ~ cf(log10(Cases)), fillOpacity = 0.8, stroke = FALSE, popup = ~Cases )
Some values were outside the color scale and will be treated as NASome values were outside the color scale and will be treated as NA
m
An alternative of the geo-visualization is to use a heat-map plot.
Make a heat-map plot by sorting the rows of the cross-tabulation matrix (that correspond to states):
matSDC <- xtabs( Cases ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDC+1), cellnote = as.matrix(matSDC), scale = "none", dendrogram = "row", colors = "Blues", theme = "dark")
Deaths
Cross-tabulate states with dates over deaths and plot:
matSDD <- xtabs( Deaths ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDD+1), cellnote = as.matrix(matSDD), scale = "none", dendrogram = "row", colors = "Blues", theme = "dark")
TBD…